Goto

Collaborating Authors

 Southeast Asia


Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia

arXiv.org Artificial Intelligence

Southeast Asia (SEA) is a region of extraordinary linguistic and cultural diversity, yet it remains significantly underrepresented in vision-language (VL) research. This often results in artificial intelligence (AI) models that fail to capture SEA cultural nuances. To fill this gap, we present SEA-VL, an open-source initiative dedicated to developing high-quality, culturally relevant data for SEA languages. By involving contributors from SEA countries, SEA-VL aims to ensure better cultural relevance and diversity, fostering greater inclusivity of underrepresented languages in VL research. Beyond crowdsourcing, our initiative goes one step further in the exploration of the automatic collection of culturally relevant images through crawling and image generation. First, we find that image crawling achieves approximately ~85% cultural relevance while being more cost- and time-efficient than crowdsourcing. Second, despite the substantial progress in generative vision models, synthetic images remain unreliable in accurately reflecting SEA cultures. The generated images often fail to reflect the nuanced traditions and cultural contexts of the region. Collectively, we gather 1.28M SEA culturally-relevant images, more than 50 times larger than other existing datasets. Through SEA-VL, we aim to bridge the representation gap in SEA, fostering the development of more inclusive AI systems that authentically represent diverse cultures across SEA.


SeaExam and SeaBench: Benchmarking LLMs with Local Multilingual Questions in Southeast Asia

arXiv.org Artificial Intelligence

This study introduces two novel benchmarks, SeaExam and SeaBench, designed to evaluate the capabilities of Large Language Models (LLMs) in Southeast Asian (SEA) application scenarios. Unlike existing multilingual datasets primarily derived from English translations, these benchmarks are constructed based on real-world scenarios from SEA regions. SeaExam draws from regional educational exams to form a comprehensive dataset that encompasses subjects such as local history and literature. In contrast, SeaBench is crafted around multi-turn, open-ended tasks that reflect daily interactions within SEA communities. Our evaluations demonstrate that SeaExam and SeaBench more effectively discern LLM performance on SEA language tasks compared to their translated benchmarks. This highlights the importance of using real-world queries to assess the multilingual capabilities of LLMs.


AI can drive business growth in Southeast Asia. But some big challenges remain

ZDNet

Southeast Asian digital economies are projected to expand to 263 billion in gross merchandise value (GMV) this year -- and artificial intelligence (AI) is poised to fuel further growth, if greater business value is extracted from the technology. According to the latest iteration of the e-Conomy SEA report, the region's digital economy will also be fueled by increasing user sophistication and the importance of cybersecurity. Also: How Singapore's SMBs and AI sustain a robust digital economy Jointly released by Singapore's investment firm Temasek, Google, and Bain & Company, the study taps insights and analyses from Temasek, Bain, and Google Trends, alongside data from research partners, expert interviews, and industry sources. The report covers six Southeast Asian markets: Singapore, Indonesia, Malaysia, Thailand, Vietnam, and the Philippines. The study includes a projection on the region's profitability, which will hit 11 billion in 2024, up 24% from 9bn in 2023 and 101% from 4bn in 2022.


A Novel Interpretability Metric for Explaining Bias in Language Models: Applications on Multilingual Models from Southeast Asia

arXiv.org Artificial Intelligence

Work on bias in pretrained language models (PLMs) focuses on bias evaluation and mitigation and fails to tackle the question of bias attribution and explainability. We propose a novel metric, the $\textit{bias attribution score}$, which draws from information theory to measure token-level contributions to biased behavior in PLMs. We then demonstrate the utility of this metric by applying it on multilingual PLMs, including models from Southeast Asia which have not yet been thoroughly examined in bias evaluation literature. Our results confirm the presence of sexist and homophobic bias in Southeast Asian PLMs. Interpretability and semantic analyses also reveal that PLM bias is strongly induced by words relating to crime, intimate relationships, and helping among other discursive categories, suggesting that these are topics where PLMs strongly reproduce bias from pretraining data and where PLMs should be used with more caution.


Samsung to cut thousands of jobs amid struggles in AI market

The Japan Times

Samsung Electronics is laying off workers in Southeast Asia, Australia and New Zealand as part of a plan to reduce global headcount by thousands of jobs, according to people familiar with the situation. The layoffs could affect about 10% of the workforces in those markets, although the numbers for each subsidiary may vary, said one of the people, who asked not to be named because the matter is private. Job cuts are planned for other overseas subsidiaries and could reach 10% in certain markets, said the person. The South Korean company has about 147,000 staff overseas, more than half of its total employees of more than 267,800, according to its latest sustainability report.


5G to hit 5.6 billion subscribers in 2029, but 4G will remain dominant in three regions

ZDNet

The number of 5G subscriptions is expected to hit 5.6 billion worldwide by the end of 2029, when 5G will be the dominant network in all regions except for three. Globally, 5G is projected to account for 60% of all mobile subscriptions by the end of 2029, according to the latest Ericsson Mobility Report. It also will drive 75% of overall mobile data traffic, which will see a compound annual growth rate of 20% through 2029. The report noted that annual mobile data traffic growth is projected to slow at varying rates across different regions during the forecast period, depending on local market dynamics. Noting that traffic growth can be volatile, Ericsson pointed to factors such as global macroeconomic changes, growth in fixed wireless access (FWA) connections, and subscriber migration to later generations in markets such as India, Southeast Asia, Latin America, and Africa.


How Good are LLMs at Relation Extraction under Low-Resource Scenario? Comprehensive Evaluation

arXiv.org Artificial Intelligence

Relation Extraction (RE) serves as a crucial technology for transforming unstructured text into structured information, especially within the framework of Knowledge Graph development. Its importance is emphasized by its essential role in various downstream tasks. Besides the conventional RE methods which are based on neural networks and pre-trained language models, large language models (LLMs) are also utilized in the research field of RE. However, on low-resource languages (LRLs), both conventional RE methods and LLM-based methods perform poorly on RE due to the data scarcity issues. To this end, this paper constructs low-resource relation extraction datasets in 10 LRLs in three regions (Central Asia, Southeast Asia and Middle East). The corpora are constructed by translating the original publicly available English RE datasets (NYT10, FewRel and CrossRE) using an effective multilingual machine translation. Then, we use the language perplexity (PPL) to filter out the low-quality data from the translated datasets. Finally, we conduct an empirical study and validate the performance of several open-source LLMs on these generated LRL RE datasets.


Tech giants start to treat Southeast Asia like the next big thing

The Japan Times

Long considered a tech hinterland, Southeast Asia is fast emerging as a center of gravity for the industry. The CEOs of Apple, Microsoft and Nvidia are among the industry chieftains who've swung through the region in past months, committing billions of dollars in investment and holding forth with heads of state from Indonesia to Malaysia. Amazon just this week took over a giant conference hall in downtown Singapore to unfurl a 9 billion investment plan before a thousands-strong audience cheering and waving glow sticks. After decades of playing second fiddle to China and Japan, the region of about 675 million people is drawing more tech investment than ever. For data centers alone, the world's biggest companies are set to splurge up to 60 billion over the next few years as Southeast Asia's young populations embrace video streaming, online shopping and generative AI.


These pioneering companies are working together to develop models for AI

ZDNet

Five telcos have inked a pact to build large language models (LLMs) customized to meet the needs of their industry as well as support multiple languages. The efforts will be driven by a new joint venture to be established by the telcos, comprising South Korea's SK Telecom, Germany's Deutsche Telekom, Abu Dhabi's e& Group, Singapore's Singtel, and Japan's SoftBank. Collectively, the telcos have a global customer base of 1.3 billion across 50 markets, according to a joint statement released Monday. Also: Want to work in AI? How to pivot your career in 5 steps The new entity will be set up this year and develop LLMs designed specifically to enhance telcos' engagement with customers through digital assistants and chatbots. These artificial intelligence (AI) models will be optimized for languages used in the telco's domestic markets, including German, Japanese, Arabic, Korean, and English, as well as others, such as Bahasa Indonesia, so the models can be rolled out in Southeast Asia.


SeaLLMs -- Large Language Models for Southeast Asia

arXiv.org Artificial Intelligence

Despite the remarkable achievements of large language models (LLMs) in various tasks, there remains a linguistic bias that favors high-resource languages, such as English, often at the expense of low-resource and regional languages. To address this imbalance, we introduce SeaLLMs, an innovative series of language models that specifically focuses on Southeast Asian (SEA) languages. SeaLLMs are built upon the Llama-2 model and further advanced through continued pre-training with an extended vocabulary, specialized instruction and alignment tuning to better capture the intricacies of regional languages. This allows them to respect and reflect local cultural norms, customs, stylistic preferences, and legal considerations. Our comprehensive evaluation demonstrates that SeaLLM-13b models exhibit superior performance across a wide spectrum of linguistic tasks and assistant-style instruction-following capabilities relative to comparable open-source models. Moreover, they outperform ChatGPT-3.5 in non-Latin languages, such as Thai, Khmer, Lao, and Burmese, by large margins while remaining lightweight and cost-effective to operate.